Yet Another Language Identifier
نویسنده
چکیده
Language identification of written text has been studied for several decades. Despite this fact, most of the research is focused on a few most spoken languages, whereas the minor ones are ignored. The identification of a larger number of languages brings new difficulties that do not occur for a few languages. These difficulties are causing decreased accuracy. The objective of this paper is to investigate the sources of such degradation. In order to isolate the impact of individual factors, 5 different algorithms and 3 different number of languages are used. The Support Vector Machine algorithm achieved an accuracy of 98% for 90 languages and the YALI algorithm based on a scoring function had an accuracy of 95.4%. The YALI algorithm has slightly lower accuracy but classifies around 17 times faster and its training is more than 4000 times faster. Three different data sets with various number of languages and sample sizes were prepared to overcome the lack of standardized data sets. These data sets are now publicly available.
منابع مشابه
Culture and Language Education
There are different views on the relationship between language and culture. Some consider them as separate entities one being a code-system and the other a system of beliefs and attitudes. Some believe in a cause and effect relationship between the two; and yet others argue for a co-evolutionary mode of interrelation. This paper will subscribe to the Hallidayan co-evolutionary view of the relat...
متن کاملطرح ادغام سرشاخه خوشه طب سنتی ایران در ساختار ابَراصطلاحنامه « نظام زبان واحد پزشکی (UMLS)»
Background & Aim: Unified Medical Language System (UMLS) is an extensive ontology of biomedical knowledge developed and maintained by U.S. National Library of Medicine (NLM). Traditional Iranian Medicine (TIM) does not have any position in the structure of metathesaurus of UMLS. The main aim of this study was designing a scheme of TIM cluster's crotch mapping in the structure of metathesaurus o...
متن کاملYet Another Application of the Theory of ODE in the Theory of Vector Fields
In this paper we are supposed to define the θ−vector field on the n−surface S and then investigate about the existence and uniqueness of its integral curves by the Theory of Ordinary Differential Equations. Then thesubject is followed through some examples.
متن کاملLanguage identification on code-switching utterances using multiple cues
Code-switching speech is an utterance containing two or more languages. Usually, the switching linguistic unit is in clause or word levels. In this paper, a two-stage framework is proposed, containing a language identifier and then a speech recognizer, to evaluate on a Mandarin-Taiwanese codeswitching utterance. In the language identifier, we use multiple cues including acoustic, prosodic and p...
متن کاملCovenant, Promise, and the Gift of Time
If we categorize religions according to whether they give greater prominence to time or to space, the role of “promise” marks a religion of covenant as clearly a religion of time. Yet the future is unknowable and can only be present to us as a field of possibilities. How far do these possibilities extend? The question directs us back to the nature of time, a question that became concealed in th...
متن کامل